姿势估计通常被解决为垃圾箱分类问题或回归问题。在这两种情况下,这个想法都是直接预测对象的姿势。这是一项非平凡的任务,因为不同姿势之间的相似姿势和相似性的外观变化。取而代之的是,我们遵循关键思想,即比较两个姿势要比估计它们更容易。到此为止,已经采用了渲染和能力方法,但是它们往往是不稳定的,计算昂贵的,并且对于实时应用程序而言却很慢。我们建议通过使用动态边缘和连续的姿势标签空间学习对齐度损失来进行类别级别的姿势估计。为了高效的推断,我们使用一个简单的实时图像检索方案,其中包含投影到嵌入空间的参考渲染集。为了实现对现实世界条件的鲁棒性,我们采用合成阻塞,边界盒扰动和外观增强。我们的方法在Pascal3D和OckludedPascal3D上实现了最先进的性能,以及Kitti3d的高质量结果。
translated by 谷歌翻译
6-DOF的视觉定位系统利用植根于3D几何形状的原则方法来对图像进行准确的摄像头姿势估计图。当前的技术使用层次管道并学到了2D功能提取器来提高可扩展性并提高性能。但是,尽管典型召回@0.25m类型的指标获得了,但由于其“最差”性能领域,这些系统仍然对实际应用(如自动驾驶汽车)的实用性有限 - 在某种程度上提供不足的召回率的位置。在这里,我们研究了使用“位置特定配置”的实用性,其中将地图分割为多个位置,每个位置都有自己的配置,用于调节姿势估计步骤,在这种情况下,在多摄像机系统中选择摄像机。在福特AV基准数据集上,我们证明了与使用现成管道相比,我们证明了最大的最差案例定位性能 - 最小化数据集的百分比,该数据集的百分比降低了一定的误差耐受性,并提高了整体定位性能。我们提出的方法尤其适用于自动驾驶汽车部署的众群体模型,在该模型中,AV机队定期穿越已知的路线。
translated by 谷歌翻译
对数据域转移的强大培训模型在学术界和行业中都引起了人们的兴趣。提问的语言模型是自然语言处理(NLP)研究的典型问题之一,在大型变压器模型的出现中取得了很大的成功。但是,现有方法主要是在训练和测试过程中从相同的分布中获取数据的,这在野外是不切实际且不可降低的。在本文中,我们探讨了学习域不变特征的对抗性训练方法,以便语言模型可以很好地推广到室外数据集。我们还检查了其他各种方法,以提高模型性能,包括通过释义句子来扩展数据,对开始单词的答案末端预测以及精心设计的退火功能。我们的初步结果表明,与这些方法相结合,我们能够在基线上获得$ 15.2 \%$改善的EM分数和5.6 \%$ $ boost的F1分数。我们还通过将模型输出投影到较低维空间来剖析模型输出,并可视化模型的隐藏状态,并发现我们的特定对抗性训练方法确实鼓励模型学习域不变性嵌入,并使它们更加接近多维空间。
translated by 谷歌翻译
不确定性遍及现代机器人自主堆栈,几乎每个组件(例如传感器,检测,分类,跟踪,行为预测)产生连续或离散的概率分布。尤其是,轨迹预测被不确定性所包围,因为其输入是由(嘈杂)上游感知产生的,并且其输出是通常用于下游计划中的概率的预测。但是,大多数轨迹预测方法并不能说明上游的不确定性,而仅考虑最明显的值。结果,感知不确定性不会通过预测传播,并且预测通常过于自信。为了解决这个问题,我们提出了一种在轨迹预测中纳入感知状态不确定性的新方法,其关键组成部分是一种新的基于统计距离的损失函数,它鼓励预测不确定性,以更好地匹配上游感知。我们在说明性模拟和大规模的现实世界数据中评估了我们的方法,证明了它在通过预测和产生更校准的预测来传播感知状态不确定性方面的功效。
translated by 谷歌翻译
Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12X, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6X decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.
translated by 谷歌翻译
The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
translated by 谷歌翻译
Monitoring water is a complex task due to its dynamic nature, added pollutants, and land build-up. The availability of high-resolu-tion data by Sentinel-2 multispectral products makes implementing remote sensing applications feasible. However, overutilizing or underutilizing multispectral bands of the product can lead to inferior performance. In this work, we compare the performances of ten out of the thirteen bands available in a Sentinel-2 product for water segmentation using eight machine learning algorithms. We find that the shortwave infrared bands (B11 and B12) are the most superior for segmenting water bodies. B11 achieves an overall accuracy of $71\%$ while B12 achieves $69\%$ across all algorithms on the test site. We also find that the Support Vector Machine (SVM) algorithm is the most favourable for single-band water segmentation. The SVM achieves an overall accuracy of $69\%$ across the tested bands over the given test site. Finally, to demonstrate the effectiveness of choosing the right amount of data, we use only B11 reflectance data to train an artificial neural network, BandNet. Even with a basic architecture, BandNet is proportionate to known architectures for semantic and water segmentation, achieving a $92.47$ mIOU on the test site. BandNet requires only a fraction of the time and resources to train and run inference, making it suitable to be deployed on web applications to run and monitor water bodies in localized regions. Our codebase is available at https://github.com/IamShubhamGupto/BandNet.
translated by 谷歌翻译
In this paper, we discuss an imitation learning based method for reducing the calibration error for a mixed reality system consisting of a vision sensor and a projector. Unlike a head mounted display, in this setup, augmented information is available to a human subject via the projection of a scene into the real world. Inherently, the camera and projector need to be calibrated as a stereo setup to project accurate information in 3D space. Previous calibration processes require multiple recording and parameter tuning steps to achieve the desired calibration, which is usually time consuming process. In order to avoid such tedious calibration, we train a CNN model to iteratively correct the extrinsic offset given a QR code and a projected pattern. We discuss the overall system setup, data collection for training, and results of the auto-correction model.
translated by 谷歌翻译
Language-conditioned policies allow robots to interpret and execute human instructions. Learning such policies requires a substantial investment with regards to time and compute resources. Still, the resulting controllers are highly device-specific and cannot easily be transferred to a robot with different morphology, capability, appearance or dynamics. In this paper, we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots. By introducing a novel method, namely Hierarchical Modularity, and adopting supervised attention across multiple sub-modules, we bridge the divide between modular and end-to-end learning and enable the reuse of functional building blocks. In both simulated and real world robot manipulation experiments, we demonstrate that our method outperforms the current state-of-the-art methods and can transfer policies across 4 different robots in a sample-efficient manner. Finally, we show that the functionality of learned sub-modules is maintained beyond the training process and can be used to introspect the robot decision-making process. Code is available at https://github.com/ir-lab/ModAttn.
translated by 谷歌翻译
This work presents a physics-informed deep learning-based super-resolution framework to enhance the spatio-temporal resolution of the solution of time-dependent partial differential equations (PDE). Prior works on deep learning-based super-resolution models have shown promise in accelerating engineering design by reducing the computational expense of traditional numerical schemes. However, these models heavily rely on the availability of high-resolution (HR) labeled data needed during training. In this work, we propose a physics-informed deep learning-based framework to enhance the spatial and temporal resolution of coarse-scale (both in space and time) PDE solutions without requiring any HR data. The framework consists of two trainable modules independently super-resolving the PDE solution, first in spatial and then in temporal direction. The physics based losses are implemented in a novel way to ensure tight coupling between the spatio-temporally refined outputs at different times and improve framework accuracy. We analyze the capability of the developed framework by investigating its performance on an elastodynamics problem. It is observed that the proposed framework can successfully super-resolve (both in space and time) the low-resolution PDE solutions while satisfying physics-based constraints and yielding high accuracy. Furthermore, the analysis and obtained speed-up show that the proposed framework is well-suited for integration with traditional numerical methods to reduce computational complexity during engineering design.
translated by 谷歌翻译